AITopics | Saint Croix Island

Collaborating Authors

Saint Croix Island

Revisiting Noise in Natural Language Processing for Computational Social Science

arXiv.org Artificial IntelligenceMar-10-2025

Computational Social Science (CSS) is an emerging field driven by the unprecedented availability of human-generated content for researchers. This field, however, presents a unique set of challenges due to the nature of the theories and datasets it explores, including highly subjective tasks and complex, unstructured textual corpora. Among these challenges, one of the less well-studied topics is the pervasive presence of noise. This thesis aims to address this gap in the literature by presenting a series of interconnected case studies that examine different manifestations of noise in CSS. These include character-level errors following the OCR processing of historical records, archaic language, inconsistencies in annotations for subjective and ambiguous tasks, and even noise and biases introduced by large language models during content generation. This thesis challenges the conventional notion that noise in CSS is inherently harmful or useless. Rather, it argues that certain forms of noise can encode meaningful information that is invaluable for advancing CSS research, such as the unique communication styles of individuals or the culture-dependent nature of datasets and tasks. Further, this thesis highlights the importance of nuance in dealing with noise and the considerations CSS researchers must address when encountering it, demonstrating that different types of noise require distinct strategies.

camembert-ft-sq-fr camembert-ft-sq-fr 54 54 52, convenient qualitative analysis and visualisation, hedonism pleasure and sensuous gratification, (16 more...)

arXiv.org Artificial Intelligence

2503.07395

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Poland (0.14)
Europe > Finland (0.14)
(130 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)
(2 more...)

Industry:

Media > News (1.00)
Leisure & Entertainment (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
(10 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(4 more...)

Add feedback

A Multilingual Perspective on Probing Gender Bias

Stańczak, Karolina

arXiv.org Artificial IntelligenceMar-15-2024

Gender bias represents a form of systematic negative treatment that targets individuals based on their gender. This discrimination can range from subtle sexist remarks and gendered stereotypes to outright hate speech. Prior research has revealed that ignoring online abuse not only affects the individuals targeted but also has broader societal implications. These consequences extend to the discouragement of women's engagement and visibility within public spheres, thereby reinforcing gender inequality. This thesis investigates the nuances of how gender bias is expressed through language and within language technologies. Significantly, this thesis expands research on gender bias to multilingual contexts, emphasising the importance of a multilingual and multicultural perspective in understanding societal biases. In this thesis, I adopt an interdisciplinary approach, bridging natural language processing with other disciplines such as political science and history, to probe gender bias in natural language and language models.

background information and human perception, gender inequality and discriminatory practice, multidimensional and multilingual approach, (16 more...)

arXiv.org Artificial Intelligence

2403.10699

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.27)
Asia > Middle East > Iraq (0.27)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(119 more...)

Genre:

Workflow (1.00)
Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
(2 more...)

Industry:

Media > News (1.00)
Leisure & Entertainment (1.00)
Law > Civil Rights & Constitutional Law (1.00)
(12 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(9 more...)

Add feedback

PHD: Pixel-Based Language Modeling of Historical Documents

Borenstein, Nadav, Rust, Phillip, Elliott, Desmond, Augenstein, Isabelle

arXiv.org Artificial IntelligenceNov-4-2023

The digitisation of historical documents has provided historians with unprecedented research opportunities. Yet, the conventional approach to analysing historical documents involves converting them from images to text using OCR, a process that overlooks the potential benefits of treating them as images and introduces high levels of noise. To bridge this gap, we take advantage of recent advancements in pixel-based language models trained to reconstruct masked patches of pixels instead of predicting token distributions. Due to the scarcity of real historical scans, we propose a novel method for generating synthetic scans to resemble real historical documents. We then pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period. Through our experiments, we demonstrate that PHD exhibits high proficiency in reconstructing masked image patches and provide evidence of our model's noteworthy language understanding capabilities. Notably, we successfully apply our model to a historical QA task, highlighting its usefulness in this domain.

dataset, historical document, phd, (14 more...)

arXiv.org Artificial Intelligence

2310.18343

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > United Kingdom (0.06)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry: Media (0.37)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Measuring Intersectional Biases in Historical Documents

Borenstein, Nadav, Stańczak, Karolina, Rolskov, Thea, Perez, Natália da Silva, Käfer, Natacha Klein, Augenstein, Isabelle

arXiv.org Artificial IntelligenceMay-21-2023

Data-driven analyses of biases in historical texts can help illuminate the origin and development of biases prevailing in modern society. However, digitised historical documents pose a challenge for NLP practitioners as these corpora suffer from errors introduced by optical character recognition (OCR) and are written in an archaic language. In this paper, we investigate the continuities and transformations of bias in historical newspapers published in the Caribbean during the colonial era (18th to 19th centuries). Our analyses are performed along the axes of gender, race, and their intersection. We examine these biases by conducting a temporal study in which we measure the development of lexical associations using distributional semantics models and word embeddings. Further, we evaluate the effectiveness of techniques designed to process OCR-generated data and assess their stability when trained on and applied to the noisy historical newspapers. We find that there is a trade-off between the stability of the word embeddings and their compatibility with the historical dataset. We provide evidence that gender and racial biases are interdependent, and their intersection triggers distinct effects. These findings align with the theory of intersectionality, which stresses that biases affecting people with multiple marginalised identities compound to more than the sum of their constituents.

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.12376

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Trinidad and Tobago (0.14)
North America > Antigua and Barbuda (0.14)
(69 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)

Add feedback

Est-ce que vous compute? Code-switching, cultural identity, and AI

Falbo, Arianna, LaCroix, Travis

arXiv.org Artificial IntelligenceDec-15-2021

Cultural code-switching concerns how we adjust our overall behaviours, manners of speaking, and appearance in response to a perceived change in our social environment. We defend the need to investigate cultural code-switching capacities in artificial intelligence systems. We explore a series of ethical and epistemic issues that arise when bringing cultural code-switching to bear on artificial intelligence. Building upon Dotson's (2014) analysis of testimonial smothering, we discuss how emerging technologies in AI can give rise to epistemic oppression, and specifically, a form of self-silencing that we call 'cultural smothering'. By leaving the socio-dynamic features of cultural code-switching unaddressed, AI systems risk negatively impacting already-marginalised social groups by widening opportunity gaps and further entrenching social inequalities.

algorithm, oppression, social environment, (15 more...)

arXiv.org Artificial Intelligence

2112.08256

Country:

North America > United States > New York (0.05)
North America > US Virgin Islands > Saint Croix Island > Saint Croix (0.05)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(4 more...)

Genre:

Research Report (0.64)
Overview (0.46)

Industry:

Law (1.00)
Leisure & Entertainment > Games (0.68)
Media > Film (0.68)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback